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(54) Data processing with parallel or sequential execution of program instructions 



(57) A data processing system 10 includes data 
processing circuitry 11 having circuitry (21 , 23 and 25) 
for producing a set of Instructions uriilch include respec- 
tive ^struction portions for indicating whether Uie re- 
spective tnstfiictions can be executed stmultaneousty 



vnVn another of ^e inslructtons. Program execution cir- 
cuitry (29) rweivee the set of instructions and is selec- 
tiveiy responsive to the instructive portions for executing 
simuitaneousiy a pturatity of the instructions indicated 
by the hetructlon pc»tlons. 



FIG. 2 
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Description 

This inventicn relates generaify !o cfaia processing and, more particularly but not exclusively, !o data processing 
with both parallel and sequential execution of program instructions. 

Data processing systems data processors are used in myriad applications which in ium have an impact on 
virtually every aspect erf life. The utility ot these myriad applicatiais can ordinarily be enhanced by ncreasing the speed 
and throughput cA the associated dala processinc; systems and data processors. 

One way lo er^ance speed and (hroughCKJt is. where possible, to execute program instructions in parallel rather 
than in sequential Jashion. One known af^oa:^ bi this regard is to utilize a special mode instructiw whch specifies 
parallel or sequential executton of program irvstruction. Another known approach in this regard is to use a mask to 
specify null instructions in a parallel type isacket. Another known soiution is to p^orm data processing in a parallel 
mode only 

Although the aloremenlioned techniques are capable ol improving speed and throughput, they are neverlheiess 
undesirably difficult lo implement and disadvantageously costly in terms of processing overhead. 

It is therefore desirable lo provide (or parallel executicm of program instructions in a manner which reduces the 
implementational difficulties and processing overhead associatad with the stxwe-descrrbed approaches. 

An object of a preferred embodiment of me present invention is to utilize a ponkjn of a given program instaicltcm 
to determine whether that instruction can be executed stmultaneousfy with another program tfieiruction. 

In general, and in a form ot ths present invention, a data processing device is provided which has circuitry for 
producing a set ol instructions which include instruction portions lor indicating whether the respective hislfuclions can 
be executed simultaneously with another of the instructions. The data processing circuitry includes program execution 
circuitry connected to the producing circuitry for receiving the set of mstructions and selective executing simultane- 
ously a plurality ol the bstructions in response to indicator portions. 

In another form of the present invention, a preferred embocKment ccvnprtses a nwthod for c^erating a central 
proeesshng unit (CPU) within a data processing device comprises the steps ctf: providing a set of instructions with 
r^pectlVA instruction portions for indicating wHiether the re^ective instructions can be executed simultaneously; and 
determining from the instruction portions whether a plurality ol the instructions can be executed simultaneously 

Other errtiodiments of the present inventicm wiH be evident from the description and drawings. 

Embodiments in accordance with the present invention mil now be further described by way of example, with 
reference Eo the accompanying drawings in which: 

Fig. 1 is a block diagram of a data processing system according lo the present invention; 

Fig. 2 is a block diagram of a portion ol the data processing circuitry of Fig. 1 ; 

Fig. 3 illusiraies the basic format of an instruction packet utilized in the present invention; 

Fig. 4 is one example of an instruction packet according to the format of Fig, 3; 

Fig. 5 illustrates the exe«ition sequence defined by the instructton packet of Fig. 4; 

Fig. 6 illustrates another exanple of an instruction packet according to the format of Fig. 3; 

Fig. 7 illusiraies the execution sequence defined by the instructk>n packet of Fig. 6; 

Fig 8 illustrates another example ol an instruclkm packet according to the forn^t of Fig. 3; 

Fig. 9 illusiraies the execution sequence defined by the instruction packet of Fig. 8; 

Fig. 10 illustrates another example of an instructton packet according to the format of Fig. 3; 

Fig, 11 illustrates the execution sequence defined by tfie irKtruction packet of Fig. 10; 

Fig. 12 illustrates another example of an instruction packet according to the formal of Fig. 3; 

Fig. 1 3 ttlifstrales the execution s^uence defined by the instruction packet of Fig. 1 2; 

Fig, 14 illustrates another example of an instructicm packet according to the format ot Fig. 3; 

Fig. 1 5 illustrate the execufion sequence defined by the instruction packet of Fig. 14; 

Fig. 16 ilfustrmes ano^ier example of an instructksn packet according to the format of Fig. 3; 

Pig. 17 illustrates )he execution sequence defined by the instrudlion packet of Fig. 16; 

Fig. 1 8 illustrates another exarr^ale an instructfan packet accordkig to the format of Fig. 3; 

Fig. 19 illustrates the execution sequence defined by the instruction packet of Fig. 18; 

Figure 20 is a btock diagram of a microprocessor whtoh has an embodiment of ttie present invention, 

Figure 21 is a block diagram of the execution units and register files of the meroprocessor of Fig. 20; 

Figure 22A is a chart which Blusteates the processmg phases of an instruction' execution pipeline in the microprcc- 

eseorof Fig.20; 

Figure 22B is a chart whteh Ulustraim the execution ph£»as of the instruction execution pipeline in the mwnsproe- 
esBor of Fig 20; 

Figure 23 is a timing diagram which Htustrates timing details of (wceseing an instruction fetch packet during the 
processing phases of Fig. 22A and execution erf the executkan packet durmg the executim ptiases of Fig 22B; 
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Figufe 24 is a biocK diagram shovwing inslriKtion dispatching in th© microprocessw erf Fig, 20; 

Figure 25 iilustrsles the basic foimat of an instrtKaion feteh packei ior the microprocessor of Fig 20: and 

Figure 26 ttlusirates an B-word fetch pacl<et that is partially paraltol. 

Corresponding numerals and symbols in the differenl figures and tables refer !o corresponding parts unless oth- 
erwise tndicaied. 

Fig. 1 is a bkwk diagram of a data processing system 10 according to an exemplary embodiment ol the present 
invention. The data processing sysiem 10 includes data processing circuitry 11 and peripheral circuitries 13, 15, 17 
and i9. tn the exemplary embodtment of Fig. .1 . the da!a processing circuitry 11 is connected to each of the peripheral 
circuitries 13, 15. 17 and 19 for transfer of information between data processing circuitry 11 and peripheral circuitries 
1 3, 1 5, 17 and 1 9 However, and as will be apparent from the following description, a data processing system according 
to an embodiment of the present invention could include any quantity and type peripheral circuitries and peripheral 
devices (such as peripherals 13.15, 17and 1 9) interconnected among themselves and with data processing circuitry 
11 in any manner heretofore or hereafter conceivabie to workers in She art. 

Fig. 2 illustrates a portion of one exempfary embodiment of the data processing cifcuitry 11 of Fig. 1. In Fig. 2, 
fetch circuitry 23 accesses memory 25 at an address ^ecNtied by program counter 21 and causes an instruction packet 
at that address to be loaded into instruction regisiw 27. Program execuUon circuitry 29 decodes and executes the 
instructions of the mstructfon pack* hokJ in instruction register 27. 

Fig. 3 illustrates the basic forn^t of the in^ruction packet fetched from memory 25. In the disclosed exemplary 
embodimenS, an instruction pacltel includes four 32-bii insirucibns A. B, C and D. As shown in Fig. 3, instructions A, 
B, C and D are stored at consecutive addresses »n memory 25. Thus, during normal sequential executitm Of program 
instructions, instruction A woukSbe executed first, ferflowed sequentially by instructions B, C andO. 

Sit 0 of eadh instruction in Fig. 3 has been designated as a p-bit. The p-btta define how the InstnictiOTS will be 
executed. The p-tits of the Fig. 3 instruction packet are inspected from teft to right by the prog^m execirtion circuitry 
29. If the p-bit of a given instructim is equal to logic one, then the next eequentia) M^trudlon in the packet is to bo 
executed in parallel with the first-mentioned kistruction. Program execution circuitry 29 applies this rule until an instruc- 
tion in the instruction packet is reached with a p-bit equal to logic 0. 

If a given insiruction has a p-toit of 0, then the next sequential instruction is executed sequentially after the.given 
instruction (^d after any instructkjns which are executed in parallel wHh the given instruction) . The program execution 
circuitry 29 af^Ues this rule until it reaches an instruction in the instruction packet wth a p-bit of logic 1. 

Figs. 4-19 provide af^ite^ion eicamples the above-de6crU»d p-bit rules. 

Fig. 4 illustrates an insinicticxi packet in which all p-bits are 0. Thus, instructkins A-D are executed sequentially 
as shown in Fig. 5. 

Fig. S illustrates an instruction packet in which the p-bits of instructions A, B and C are equal to 1 , and the p-bi1 of 
instruction D is 0. Thus, instructions A, B, C and D are executed simultaneously, that is, in parallel as shovun in Fig. 7. 

In the InsSruclion of Fig. 8, only the p-bit of instruction C is set to one, resulting in the executbn sequence o( Fig. 
9, namely, instructions A and B are executed sequenlialiy, lolfowsd by instructions C and 0 which are executed in 
parallel. 

In the instruction pacteJ of Fig. 1 0, only the p-bit of instruction B is set to w»e. resulting in the execution sequence 
^own in Fig. n , namely instruction A is execut ed and then followed sequentially by the parallel execution of jistructions 
B and C, which is than followed sequentially by execution of instmction D. 

In ttie instruction packet of Fig. 12. the p-btts of instructions B and C are set to one. and the p-bits of instructions 
A and D are zero. This results in the instruction sequence shown in Fig. 13. namely instruction A is executed and is 
then sequentially ioliowed by the parallel executton of instructions B, C and D. 

In the insiruction packet of Fig. 14, tsnly ttie p-brt of instruction A is set to logic one, resulting in the execution 
sequence shown in Fig. 1 5, namely instructions A artd B are executed in parallel arKf then foik>wed sequentiall/ by the 
executton of mstructton C and then the executkai of instruction D. 

In the inslruclKMi packet of Fig. 16, the p-bits of Inductions A and C me set to one and the p-bits of instructions 
B and D are 0, resulting in the exeortion sequence illustrated in F^. 17, namely the paraliei execution of instructnns 
A and B toiSowed sequentially by ttie parallel execution ol instructions C and D. 

In the insiruction packet of Fig. 16, the p-bits oS instructions A and B are set to 1 and the p-bits of Hiatructione C 
and D are 0. This results in the execution sequence illustrated in Fig. 1 9. namely instructkjns A, 8 and C are executed 
in parallel and then followed sec^entialiy by execution of instruction 0. 

Because the instruction packet in ttie disclosed exartHsle includee 4 program Instructions, the program coiTHWier 
can always proinde instruction D (Bie fourth instruction) with a p-bit of 0. The compiler determines the values of the 
rematning p-trite of instructicms A, 8 and C based on the propriety of executing instructions A and B in parallel, ttie 
propriety of ax^uiing instructions 8 and C in patsAini, and ttie propriety of executir^ in8truelk>n£ A, B and C in parallel. 
For example, if execution of instructton B requires a result provided by execution erf instruction A. then the compiler 
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would provide instruction A wrih a p-bit o* 0 so that instruction 8 wouKS be executed sequentially after inslruction A, As 
analher example, it instructions 8 and C access the same register, then She compiler vwould pfovide instruction B with 
a p-bit of 0 to ensure that instructions 8 and C are ex&:;uted sequsntiafty rather than in parallel. 

Figure M is a biock diagram of a micrc^rocessor 1 which has an ^bodimenl oi (he present invention. Microproc- 
essor 1 is a VLIW digital signal processor CDSP*), (n the Werast of darity. Figure 20 only shows those portksns of 
mcfoprocessor 1 that are relevait to an understanding of an enAodanmit of the present irwentioo. Details of general 
construction for DSPs are well knovsm, and may be found readBy elsewtiere, Fw exampJe, U .S. Patent 5,072,41 8 issued 
to Frederick Boutaud, et al, describes a DSP in detail and is irworporated herein by reference. U.S. Patent 5,329,471 
issued to Gary Swoboda, e> al. describes in detail how to lest and enwlaie a DSP and is ricorporated herein by refer- 
ence. Details of portions of micropfocessor 1 relevsunt to an en-toxJimcnt of tfie present invention are oqjIatfiwJ in 
sufficient detail hereinbetow, so as to enaWe one of ordinary sktil in the ntictoprocessor art to make and use the inven- 
tion. 

In niicropr<»essOf 1 there are shown a central processing uni! (CPU) 10, data memory 22, program memory 23, 
peripherals 60 and an external memor/ interface (EMIF) wim a direct memory access (DMA) 61 . CPU 10 tunher has 
an instruction fetch/decode unit lOa-c . a plurality of executkJn units, including an arithmetic and load/store unit 01 , a 
muftrplier Ml , an ALU/snmer unit SI , an arithmetic logic unit ('ALU') Li, a ^ared mtiltiport register file 20a from vwhtch 
data are read and to which data are written Decoded instructions are pro\rided Irom the instruction fetch>aecoda unit 
lOa-c to the functional units 01, Ml . Si , and Li over varksus sets of control lines wrtiich are not shovm. Data are 
provided to/from the register tile 20a from/So to toad/store units Dl over a first set of busses 32a, to multiplier Mi over 
a second sal of busses 34a, io ALU/shifter unit S1 over a third set erf busses 36a and lo ALU Li over a fourth set of 
busses 38a. Data are provided to/from the memory 22 fronvtolhe load/store units Dl via a fifth set of busses 40a. 
Note that the entire data path descrtoed above is duplicaled wfth register file 20b and execution units D2. M2, S2, and 
L2. In^ructions are fetched by letch unit lOa from instructksn memory 23 over a set of busses 41. EmulatkMi unit 50 
provides access to tt>e internal operatkjn of intsgrated crrcuil i whie*i can be controlled by an external test system Si. 

Note that ttie mwnory 22 and memory 23 are shown in Figure 20 to be a part of a micre^rocessor 1 integrated 
circuit, the extent of which is represented by the box 42, The memories 22-23 could just as welt be oxternat to the 
mteroprocessor 1 integrated circuit 42, or part of it could reside on the integrated circuit 42 and part of it be external 
to the integrated circuit 42, Also, an alternate number of execution units can be used. 

When microprocessor 1 is incorporated in a data processing system, additional memory or peripherals may be 
connected to micr opr ocesaor 1 , as Blustrated in Figure 1. For example, Random Access Memory {RAf>/) 70. a Read 
Only rutemory (ROM) 71 and a Disk 72 are shown connected via an external bus 73. Bus 73 is connected to the External 
Memory Interface (EMIF) which is part of fundrcmal block 61 within mteroprocessor42. A Direct Memory Access (DMA) 
controller is also included within bkjck 61. The DMA controller is generally used to move data between memory and 
peripherals within microprocessor 1 and memory and peripherals which are external to microprocessor 1 

Several example systems which can benafii from aspects of embodiments of the present Invention are described 
in U.S, Patent 5,072,418, whteh was incorporated by reference hereki, particularly with reference to Figures 2-18 of 
U.S. Patent 5,072.418, A micropnx;assor incorporalBig an a^ect of an embodiment of the present inveniion to improve 
performance or reduce cost can be used to further improve the systems described in U S Patent 5,072,418. Such 
systems include, bul are not limitsd to, industrial process controls, automcMive vehkile systems, motor controls, robotic 
control systems, satellite teiecommunicalion systems, echo canceling systems, modems, video imaging systems, 
speech recognition systems, vocoder-modem systems with encryption, and such. 

A description of various architectural features o( the microprocessor of Fig, 20 is provided in coassigned application 

serial number (Tl docket number T-25311). A description of a complete set of instructions for the 

micrc^rocessw of Fig. 20 is also provieled in coassigned appscation serial number (Ti docket 

number T-25311 ). 

Figure 21 is a block diagram of the exefartion units and register files of the microprocessor of Fig. 20 and shows 
a more detailed view ol the buses connecting the various functksnat blocks. In this figure, all data busses are 32 bits 
wide, unless otherwise noted. Bus 40a has an addre^ bus DAI whk:h is driven by mux 200a. TTiis aBows an adc^ess 
generated by either load/store unit D1 or D2 to provkte an address for toads or stores for register file 20a, Data Bus 
LDI toads data from an address in memory 22 specified by address bus DAI to a register in load unit Dl . Unit Dl may 
manipulate the data provided prior to storing it in register file 20a. Likewise, data bus ST1 stores data from register file 
20a to memory 22. Load/store unit 01 performs the following operatmns: 32-b(i add, subtract, linear and circufaraddress 
calculations. Load/store unit D2 operates simHarly to unit Dl , wttithe assistance of nuix 200b for selecting an address. 

ALU unit U perfonns the foltoviftng typos of oparatitMis: 32/40 Wt arithmetic and compare operations; left most 1, 
0, bit counting for 32 bits; normaiizalicx^ count for SZ and 40 bits; and togb^ operations. ALU U has input srcl for a 
32 bit source operand and input Brc2 fw a second 32 bit source operand. Input msb_src is an S bit value used to form 
40 bit source operands. ALU Li has an output dst tar a 32 btt destfciation c^erands. Output msb_dst is an 8 bit value 
usad to form 40 bit cteslinalton operands Two 32 bit registers in register file 20a are concatenated to hold a 40 bit 
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operand. Mux 21 1 is connected to input srcl allows a 32 t^erand to be olstained trom regisier fite 20a via bus 
38a or from register fite 20b via bus 210. Mux 21 2 is connected to input src2 and allows a 32 bit operand to be obtained 
from register file 20a via bus 38a or from register file 20b via bus 21 0. ALU unit L2 operates simiiarly to unit 11 . 

ALU/^tfter unit Si performs the foflowing types oi operations: 32 bit arithmetic operations: 32/40 bit shifts and 32 
bit bit-field operations: 32 bit logica! operattOTs; brarvctiing: and constant generation. ALU Si has input srcl for a 32 
bit source c^erand and input src2 for a seccaid 32 bit source operand. Input imsb_src is an 8 bit value used to form 40 
bit source operands. ALU S 1 has an output dst for a 32 bH destination operands. Output msb.dst is an 8 bit vahje used 
to form 40 bH destination c^erands. Mux 21 3 is connected to input sfc2 and aHowraa 32 bit operand to be obtained 
from register file 20a via bus or from register file 20b via bus 210. ALU unit S2 operates similariy to unit Si, but 
can additionally perform register transfers to/from ine control register fite 102. 

MuJtiplier Ml performs 16x16 multiplies. Muliiplier M1 has riput srcl (or a 32 bit source operand and input src2 
for a 32 bit source operand. ALU SI has an output dsl for a 32 bit destination operands. Mux 214 is connected to input 
src2 and allows a 32 bit operand to be obtained trom register file 20a via bus 34a or from register flte 2Cto via bus 210. 
Muftipfter M2 c^eratss similariy to muttiplier Ml . 

Figure 22A is a cfiart which illustrates the processing phases of an ifwtruction execution pipeline In the micfoproc- 
essof o! Fig. 1 Each phase corresp<^ds roughly to a clock cycle of a system cfod<. For example. M mic«^rocessor 1 
ts bemg operated at 200 MH2, then each phase is nominally 6 Ns. However, in a phase where data is expected from 
a memory or peripheral, such as RAtui 70. the pipeline w«l stall if the data is ntM ready when expected. When stalled, 
a given pipeline phase will exist for a number of system clock cycles. 

In Figure 22A, the first phase ol processing an instruction is to generate the program address in phase PG. This 
is don e by loading a program fa tcrfi counter PFC which is located in control register file 1 02. During th e second instruction 
processing phase PS, an address of an instruction fetch packet is sent to program memory 23 via a program address 
bus PADDR which is part of bus 41 . The third phase PW is a wait phase to allow for access time in memory 23. During 
the fourth phase PR, a program fetch packet is available from program memory 23 via data bus PDATAJ which is part 
of bus 41 During the fifth processing phase DP, instruction parallelism is detected and mstruclions that can be executed 
are dispatched to the appropriate functional units. This aspect of pipeline operation will be described h more detail in 
later paragraphs During the sixth processing phase DC, executable instructions are decoded and controt signals are 
generated to control the various data paths and functional units. 

Figure 22B is a chart which illustrates the execution phases of the instruction execution pipeNne in the microprocT 
essor <^ Fig 1 During the first execution phase E 1 , single cycle instructions, referred to as ISC", and brarKsh instruc- 
tions, referred to as "8B", are completed. A designated execution unit performs the operations (ncfeated in Fig. 22B' 
as directed by control circuitry 100. During the second execution phase E2, the foUowing'types of instructions are 
completed by designated execution units under control of control circuitry 100: integer multiply (IMPY), program store 
instructions (STP), and data store instructions (STD). During the third execution phase E3, execution erf load data 
instfuciions (LO) continues by latching data from the data memory system (DMS), as indicated. During executicn phase 
E4, the data latched in £3 is transferred to a cteita input register DDATA.I in execution unit Dl or 02. During executiixi 
phase E5, the LD aistructicxi is completed by manipulating Bie data in register DDATAJ and writing the man^juiated 
data to a pacified re^er in register ftls 20a or 203. 

Figtjre 23 is a timing diagram which illustratee timing details of prccessmg an instruction fetc^ packet during the 
processing phases of Fig. 22A and execution of the ex^utton packet during the execution phases of Fig. 22B. Note 
that a pipe stall is iliustrated in phase PW due to a progrsmn mamofy reac^ signal PRDY being low in phase PS. and 
a second pipe stall in phase E3 due a data memory ready sign^ ORDY being low in phase E2. 

Figure 24 is a block diagran showing instruction dt^fchlng in the mkrroprocessor of Fig. 20. in this embodiment, 
an Instruction fetch packet contains ei^t instriKtlons. Instruction fetch padtet 1 710 is cHspatohed and decoded to ei^t 
execution units as illustrated. Fetch packet 1720 contams a braiKsh histructim 1725. Insuuction fetch packet 1730 
contains three instmction ex»:ute packets The first execute packet contains two int^ructions. 2£no-SHL, wttich wilt 
begin processing in the first drtay skJl of branch insin^ctkm 1725. The eeccsid execute packet contains four In^rudton, 
ADD-SU6-STW-STW, which wiR begin procmsrng in the second delay sk>i of branch mslruct»n 1 72S. The Uiird execute 
packet contains two instructions, ADDK-^, which will begin processing in the third delay stot of branch instructicx) 
1725, 

Parallel Operations 

Instructions are always fetched eight at a time. This constitutes a fetch packet. The basic format of a fetch packet 
is shown in Figure 2S. The execution grouping of the fetch packet is specified by the p-dit, bit zero, of each instruction. 
Fetch packets are 8-word aligned. 

The p bit contrcMs the parallai execution of tfistroctions. The p-bits are scanned from left to right {lower to higher 
address). If the p bit of instructk>n t is 1 . then instruction i -4- 1 is to be executed m parallel vwth {in the same cycle as) 
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instfuctiwi i. if the p-bitot insiruction i is 0. ihen instruction i + 1 is executed in the cycle after instruciion I All aistructions 
executing in parallel constitute an execute packet. An axacute packet cancwitain up to eight instrucltons. Ail instructions 
in an execute packet must use a unique functional unit. 

An execute packei canncs! cross an 8-v«>rd tjoondary Therefore, the last p-bS in a fetch packet is always se) to 0, 
and each istch packet starts a rtew execute packei. As discussed wfth fegaidto Figures 4-19. there are three types 
of p-bit patterns (or fel<*i packets. These three p-bit patterns result in the foMowtng executicwi sequences for lha eight 
instructions; Fully serial, Fully parallei. Partially serial. 

Example Paraitet Code 

The H characters signify that an instruction is to execute in paraflel wth the previous hstructkMi. In the fetch packet 
of Figure 26, the code woukf be represented as this: 



instruction A 



instruction B 

instruction C 
I I instruction D 
I J instruction E 

instruction F 
t ! instruction G 
I ( instruction H 

Branch)r>g Into the Middle of an Execute Packet 

If a brarvch into ttie mkidle of an executicm packet occurs, alt in^ructiorw at lower addresses are ignored, tn the 
example m F^ure 26, if a branch to the address containing instruction D occurs, then onfy D and E will execute. Even 
though instruction C ie in tte same execute padtet. it is ignored. Ins^-uctlona A and B are also ignored because they 
are in earSer execute packets. Resource Constraints 

No two instructions within the same execute packet can use the same resources. Also, no two insfructiais can 
write to the same register during the same cjrcte. The fc^liowing secttons deiscrftje each of the resources an instruction 
can use. 

Functional Units 

Two instructions using the same functional unit canned be issued in (he same execute packet. 
The (oOowhg execute packet is invalKf: 

ADD SI AO, Al, A2 ; \ .si is used for 

SHR .31 A3, 15, A4 ; / ^^^^ 

instructions ■■ • "- 

The following execute packet is valid: 
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ADD LI AO, Al, A2 ; \Two different 

functional 

SHR .SI A3, 15, A4 ; / units are used 

Cross Paths {IX and 2X) 

One unit (either a S, L, or M) per data path, per execute packet, can read a source operand from lis oppostid 
register nte via the cross paths ( 1 X and 2X). For example, .51 can read both operands from the A register file, or one 
operand from the B register file using the 1X cross path. This is danotadby an X fotlowrig the unit name. 

Two instructions using the same X cross patfi between register files carviot be issued in the same execute packet 
since there is only one path from A to 8 and one path from B to A. 
The following execute packet is ir^walid: 

ADD.LIX A0,B1,A1 ; \ IX cross path is used 
MPY.MIX A4,34,A5 ; / for both instructions 

The following execute packet is valid: 

ADD.LIX A0,B1,A1 ; \ Instructions use the Ix 

and 

MPY.M2X A4,B4,132 ; / 2x cross paths 

The c^ersnii will come from a register file opposite of the destination if the x bit in the irtslruclion freld is set. 
Load and Store Path 

Loads and stores can use an address pointer from one register file ^ile loading to or storing from the other register 
file Two loadsand'or stcn^es using an address pointer from the same register file canrvst be issued in the same execute 
packet. 

The following execute packet is invalicf: 

LDW DX *A0,A1; \ Address registers from the 
same ' \- 

I I LDW.Dl *A2,B2; / register file. 

The following execute packet is valid: 

LDW Dl *A0,A1; \ Address registers from 
different 

n LDW. 02 ■*B0,B2 ; / r'egi'ster files 

Two loads and/or stores loading to and/or storing from the same register file cannot be issued in the same execute 
packet 

Ths folowing execute packet is invalid: 
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LDW Dl *A4,A5; \ Leading to and storing from 

the 

n STW.D2 A6,*B4/ / same register file 
The toBowmg execute packet is vaJkJ: 

LDW Dl •A4,B5; \ Loading to, and storing 

from 

!l STW.D2 A6,*B4; / different register files 

Long Paths 

only one iong result may be written per cycte on each side oJ the register file. Because the S and .L units share 
a read register port ior long source operands and a write register port for tong resufts, only wis may be Bsoed per side 
in an execute packet. 

The following execute pad^et is invalid: 

ADD. LI AS;A4,AX,A3 :A2; \ Two long writes 
SHL.Sl A8, A9, A7 :A6 ; / on A register 

file 

The following execute packet is valid: 

ADD. LI A5:A4,A1,A3 :A2; \ One long write for 
SHL,S2 B8, B9, B7 :B6 ; / each register 

file 

Because the .L and .S imits share iheir tong read pon with the store port, operations that read a long value cannot 
be issued on the .1. and/or . S units in the same execute packet as a store. 
The following execute packet is invalid: 

ADD. LI A5:A4,A1,A3 :A2; \Long read operation 

and 

n STW.DI A8, *A9 ; / a Store 

The following ©xecuse packet is valid: 



ADD. LI A4, Al, A3 :A2 ; \ No long read with 
li STW.Dl AB, *A9 ; / with the Store 
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More than (our fsads ol the same register canned occur on the same cycle. Conditional registers are not inclueJecl 

n this count. 

The fotlowtng code sequence is invaiid: 



MPY 
register Ai 
i I ADD 

! I SUB 



-Ml Ai,Al,A4 



.LI A1,A1,A5 
.01 A1,A2,A3 



Whereas this code sequence is valid: 

MPY .Ml A1,A1,A4 



only four reads of 



[AI] ADD .Li A0,A1,A5 
SUB .Dl A1.A2,A3 
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Multiple writes to the same register on the same cycle can occur if instructions with different latencies writing to 
the same register are issued on different cycles. For eicampte. an MPY issued on cycle i followed by an ADD on cycle 
i+1 cannot write to the same register since both instructions will write a result on cycle Therefore, the following 
code sequence is invalid: 



.Ml 



A0,A1,A2 
A4,AS,A2 



Deledabillty of Write Conflicts 

The fo8owhg sequer^ce of execute packets shows different multiple write conflicte. For example, the ;>^D and 
SUB In execute packet LI write to the same register. This c»iflict is easMly detectable. 
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LI : 

conflict 



detectable. 



I I SUB.S2 B8;S9, B7 

L2: MPY.M2 B0,B1,E2 ; \ not 

detectable 

L3 ADD.L2 B3,B4,B2 ; / 

L4: 1:B0)ADD.L2 B5, B6, B7 /detectable, no 

conflict 

I 1 [BO] SUB.S2 B8, B9 87 

I!B1I ADD,L2 B5,B6,B7 ;\ not 

detectable 

If (BO) SUB.S2 B8,B9,B7 ; / 

The MPY in packet L2 and the ADD in packet L3 might both write to B2 simultaneously; however, « a branch 
instaiciion causes the execute packet after L2 to be something ottier than L3, this would not be a conflict. Thus, the 
potential conflict in L2 and L3 might not bo detected by the assembtor. The instructions in L4 do not constitute a write 
conflict because they are mutualiy exclusive. In contrast, because it is no) obvious thai the instructions in L5 are mulualty 
exclusive, the assembler cannot determina a conflict l( the p-pefine does receive commands toperfotm multiple writes 
to the same register, the result is undefif>ed. 

Although exempfary enntiodiments of the present invention are described above, this does not limit the scope of 
the invention, w^ich can be practiced in a variety of embodiments. 

The scopa of the present disclosure includes any novel feature or combination of featurss/lisclosed thera^ either 
expHcitly or irr^icitly or arty generaiisation thereof iirespective of whether w not it relates to the claimed invention or 
mitigates any or all of the problems aetiressad by line pres«it invention. The applicant hereby gives notice that new 
claims may be formulated to such features during ttie prosecution of this application or of any such further application 
derived therefrom In particular, with reference to the appended claims, features from dependent claims may be com- 
bined with those of the independent ctainns in any appropriate manner and not merely in specific combinations enu- 
merated m the claims. 



1 . A data processing system, comprising: 

data procesSHig crcuitry having circuitry for producing a set of instri>cttons whtdi include respective instruction 
portions tor indicattng whether the respective instru«ior>s can be executed simultaneous)/ wHh ancMher ol the 
insuuctions; and 

said data prewessing ctrcuitiy incHwlhg pn^ram execution circuitry connected to said producing circuitry for 
receiving the set of instructions and selectively respc»isive to said instruction portions (<N- executing simulta- 
neously a plurality of said instructions indicated by said instruction portions. 

2. A method of processirtg program instructions in a data proceswng system, comprising the steps of: 

providing a set of tnetructtons with resprctive instruction portions for indicating whether the respective instruc- 
tiarts can be executed simultaneously wWi artothsr of the irM^nictionB; and 

determining from said instroc«on portiwis whether a plurality of said instructions can be executed simultane- 
ously. 

3. A method ol compiling a program tor execution by a data processing system, compriskig the steps of: 
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desermining whether a first program instruciion can be executed simultaneously with a second program In- 
siructiwi thai immediately sectuentially follows the first program irwtructior» in program; and 
providing \t\e lirst instruction with an instnjction portkm that indicates whether the first instruction can be ex- 
ecuted simultaneously with the second instructtm. 
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